COS 340: Reasoning About Computation∗ Sketching

نویسنده

  • Moses Charikar
چکیده

The emergence of the web and the explosion of digital data of various forms has created the need for very efficient algorithms to deal with large data sets. A search engine is a good example of an application that needs to efficiently process very large amounts of data. Think about processing query logs to compute statistical information about the queries (e.g. the number of distinct queries), or processing the collection of documents crawled by the search engine to identify and eliminate near duplicate documents. Both these applications can be readily solved if the data set is relatively small, for example, if it fits in main memory. However the sheer size of the data set renders most common approaches infeasible. In the case of computing the number of distinct queries, the obvious technique that comes to mind is to make one pass over the query log and record the set of distinct queries seen so far (e.g. by storing them in a hash table). This can then be used to report the number of distinct elements. A search engine like Google probably receives close to a billion queries every day. The hash table approach is not a very practical solutions in this setting. How would you ever hope to scale this solution to compute the number of distinct queries in a year ? An emerging area in Computer Science is that of streaming algorithms, referring to very efficient algorithms that process very large amounts of data by making one or more passes over the data and recording very compact summaries, or sketches of the data. In constructing such compact summaries, we discard most of the data set, but retain some valuable information about it that still enables us to perform the computation we wanted. Ideally, we would like to be able to guarantee that such sketches allow us to compute the same things that could be computed with the entire data set. In most cases, this is too good to be true and we must settle for an approximation of what we wanted to compute. However, there is usually a tradeoff between the size of the sketch and the quality of the approximation it supports. In many cases of interest, by choosing our sketch sizes sufficiently large, we can ensure that the quantities of interest can be computed to any specified degree of accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

COS 340: Reasoning About Computation∗ Hashing

A hash table is a commonly used data structure to store a set of items, allowing fast inserts, lookups and deletes. Every item consists of a unique identifier called a key and a piece of information. For example, the key might be a Social Security Number, a driver’s licence number, or an employee ID number. For our purposes, we focus only on the key. Recall that the operations we would like to ...

متن کامل

Using Analogy to Cluster Hand-Drawn Sketches for Sketch-Based Educational Software

76 AI MAGAZINE Sketching and drawing are valuable tools for communicating conceptual and spatial information. When people communicate spatial ideas with each other, drawings and diagrams are highly effective because they lighten working memory load and make spatial inference easier (Larkin and Simon 1987). Visual representations may also be helpful for communicating abstract ideas, even when th...

متن کامل

From Visual to Logical Representation A GIS-Based Sketching Tool for Reasoning about Plans

Abstract: Multi-modal and heterogeneous logic reasoning is of increasing importance within the AI community. The GIS based ArcView COA Sketcher (ArCS) sketch and translation tool developed under DARPA’s High Performance Knowledge Bases program is an example of an enabling tool towards that goal. Army Course of Action (COA) sketches can be drawn and translated automatically into statements in a ...

متن کامل

Handout 1: Mathematical Background

This is a brief review of some mathematical tools, especially probability theory that we will use. This material is mostly from discrete math (COS 340/341) but is also taught in many other courses. Some good sources for this material are the lecture notes by Papadimitriou and Vazirani (see home page of Umesh Vaziarani), Lehman and Leighton (see home page of Eric Lehman, Chapters 18 to 24 are pa...

متن کامل

COS 340: Reasoning About Computation∗ Game Theory and Linear Programming

A two player game (or more correctly, a two player normal-form game) is specified by two m × n payoff matrices R and C corresponding to the row and column player respectively. Each of these matrices has m rows corresponding to the m strategies of the row player and n columns corresponding to the n strategies of the column payer. The row player picks a row i ∈ [m], and the column player picks a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011